set.seed(394)
n_individuals <- 1000
n_snps <- 15
n_populations <- 3 # populations (elevations)
# population labels (equal group sizes for PCA clarity)
popidx <- rep(1:n_populations, each = ceiling(n_individuals/n_populations))[1:n_individuals]
#MAF patterns for 3 SNP types (Does this make sense?)
maf1 <- c(0, 0.8, 0.7) # Type 1: Only in populations 2 and 3
maf2 <- c(0.3, 0.5, 0.6) # Type 2: Common in all, but varying
maf3 <- c(0.3, 0.1, 0.5) # Type 3: Mostly in population 3
# Function to simulate SNPs with population-specific MAFs
sim_data_onevar <- function(pop, maf) {
rbinom(n = length(pop), size = 2, prob = maf[pop])
}
# Simulate SNPs for each type
snps1 <- replicate(5, sim_data_onevar(pop = popidx, maf = maf1)) # 5 SNPs of type 1
snps2 <- replicate(5, sim_data_onevar(pop = popidx, maf = maf2)) # 5 SNPs of type 2
snps3 <- replicate(5, sim_data_onevar(pop = popidx, maf = maf3)) # 5 SNPs of type 3
# Combine all SNPs into a matrix
snp_data <- cbind(snps1, snps2, snps3)Simulation Study on Climate Adaptation using PCA
Content Summary 2
🌡 Simulation Study on Climate Adaptation using PCA 🌡
We will study the Tree Climate Adaptation by simulating SNPs data associated to Carbon Release rate.
Research Question
Primary Question: Do studied tree populations have SNPs associated with adaptive traits (e.g., metabolic rate, CO2 release) that allow them to cope with changing temperatures?
Secondary Question: Can we identify specific SNPs that contribute to reduced CO2 emissions in trees from lower elevations, where temperatures have increased the most due to climate change?
Hypotheses
Hypothesis 1: Trees from different populations wil not show any SNPs association with reduced metabolic rates (lower CO2 release) as an adaptation to higher temperatures.
Hypothesis 2: The identified SNPs are linked to genes involved in stress response, carbon metabolism, or mitochondrial efficiency.
To answer this questions we will conduct a simulation study.
Our Dataset
🌳 Simulating Tree Genetic Data 🌳
We will generate:
- 1000 tress with random SNPs (0, 1, 2 copies of a minor allele).
- A tertiary CO2 release rate trait (higher, med, low???) influenced by a SNP.
- 3 tree species at different elevations which are our populations.
elevation <- ifelse(
popidx == 1, runif(n_individuals, 100, 300), # Low elevation
ifelse(
popidx == 2, runif(n_individuals, 300, 600), # Mid elevation
runif(n_individuals, 600, 1000) # High elevation
)
)
elevation_category <- case_when(
elevation < 300 ~ "low",
elevation < 600 ~ "mid",
elevation >= 600 ~ "high"
)# effect sizes to SNPs (adjust indices based on SNP types)
genetic_effect <-
0.5 * snp_data[, 1] + # SNP1 (type 1)
0.3 * snp_data[, 6] - # SNP6 (type 2)
0.4 * snp_data[, 11] # SNP11 (type 3)
# environmental effect (higher elevation = lower CO2 release) (Confounder, no?)
environmental_effect <- -0.01 * elevation
# make trait and add noise
trait <- genetic_effect + environmental_effect + rnorm(n_individuals, mean = 0, sd = 0.5)tree_pcadata <- data.frame(
population = as.factor(popidx),
elevation = elevation,
elevation_category = as.factor(elevation_category),
trait = trait,
snp_data
)
# Name SNP columns
colnames(tree_pcadata)[5:(ncol(tree_pcadata))] <- paste0("SNP", 1:ncol(snp_data))Looking at the data, do you notice that there are variables or other factors that might influence our results other than tree populations?
head(tree_pcadata) population elevation elevation_category trait SNP1 SNP2 SNP3 SNP4 SNP5
1 1 156.3270 low -2.204852 0 0 0 0 0
2 1 210.7208 low -2.137725 0 0 0 0 0
3 1 156.7825 low -1.525669 0 0 0 0 0
4 1 196.8221 low -1.504617 0 0 0 0 0
5 1 153.4105 low -1.494680 0 0 0 0 0
6 1 281.4923 low -3.128641 0 0 0 0 0
SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12 SNP13 SNP14 SNP15
1 1 0 0 0 0 1 0 0 1 1
2 0 0 0 1 0 1 2 0 0 1
3 0 2 1 1 0 0 1 1 1 0
4 1 1 2 0 1 1 0 0 1 0
5 0 1 0 1 0 1 0 1 1 1
6 0 1 0 0 0 0 0 1 1 1
dim(tree_pcadata)[1] 1000 19
Why do we care about this other factor? How could it play a role in our experiment?
🚩 Confounding in GWAS 🚩
“A confounder is a common cause of both the causal variable of interest and the outcome (e.g. living area could be a confounder of fireplace presence and house price)” (STAT 155 Notes, Macalester Statistics Faculty).
A variable \(X_2\) is a confounder if: - causally associated with the outcome Y in the population - causally associated with the predictor of interest \(X_1\) in the population (i.e., \(X_2\) is a common cause of both the outcome and predictor of interest)
Generally in GWAS…
🧬 Genetic Variation Across Elevations 🧬
Environmental gradients (e.g., elevation, climate) can drive spatially structured genetic variation through local adaptation. For example, populations at high altitudes often exhibit allele frequency shifts in genes like EPAS1, linked to hypoxia tolerance (e.g., Tibetan populations as showed by Simonson et al., 2010).
In GWAS, such geographic or environmental clines can confound SNP-trait associations if unmodeled.
PCA helps to:
Detect continuous genetic gradients (e.g., allele frequency changes correlated with elevation).
Adjust for confounding by including top principal components (PCs) as covariates, isolating true SNP-trait signals from spatially structured noise.
Specifically for this study…
The question marks on our Causal Graph that go from Elevation to Tree genus, and Elevation to Metabolic rate indicate that there might be a causality relationship that is associated with both the treatment and the outcome of our experiment.
Temperatures are very low at high altitudes and can remain below 0 °C during summer -> Low temperatures are identified as critical limitations to plant biochemical processes and physiological activities (Larcher and Bauer 1981, cited by Zheng et al.,2021).
Evolutionary theories talk about how environmental conditions lead to speciation throuhg natural selection, highlighting how genetic variation play a role on this process (Stebbins, 1950)
Why do we want to use PCA (Principal Component Analysis)?
PCA identifies axes of genetic variation (principal components) that capture ancestry. These axes/ancestry often correlate with unmeasured confounders, like in our experiment geographic elevation. By including top PCs as covariates in GWAS, we adjust for latent population structure, reducing spurious SNP-trait associations (Price et al., 2010; National Academies, 2023).
🌱 Population Structure 🌱
Population Structure refers all non-random genetic variation across human groups that are shaped by historical demography (e.g., migration, drift, etc) are what defines population structure (Price, A. et al., 2010). In GWAS, we study the association between certain SNPs to a trait, therefore when we have variation that is connected to a human group, wheather or no there is association with the trait, it Introduces spurious (false) associations between genetic variants and traits (National Academies, 2023; Bryc, K. 2015). This is a current challenge that GWAS tries to address by accounting for it and developing methods that address population structure, family relateness and cryptic relateness (Price, A. et al., 2010).
Exploring the Data
tree_pcadata %>%
group_by(population) %>%
count()# A tibble: 3 × 2
# Groups: population [3]
population n
<fct> <int>
1 1 334
2 2 334
3 3 332
Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?
Needs editing - interpret tree_pcadata
Running PCA
tree_geno <- tree_pcadata %>%
select(-population, -trait, -elevation, -elevation_category)
# Run PCA on SNP data
tree_pca_out <- prcomp(tree_geno, center = TRUE, scale = TRUE)Extracting Loadings and Scores
# pull out scores, loadings, and variance
scores_tree_pca <- tree_pca_out$x
loadings_tree_pca<- tree_pca_out$rotation
var_tree_pca <- (tree_pca_out$sdev)^2
Plot Results
Scores
colnames(scores_tree_pca) [1] "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7" "PC8" "PC9" "PC10"
[11] "PC11" "PC12" "PC13" "PC14" "PC15"
scores_tree_pca %>%
as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(tree_pcadata$population)) %>% # add the population labels
ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
geom_point() +
scale_color_brewer(palette = 'Dark2')+
theme_minimal()scores_tree_pca %>%
as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(elevation_categorical = as.factor(tree_pcadata$elevation_category)) %>% # add the population labels
ggplot(aes(x = PC1, y = PC2, color = elevation_category)) + # then plot
geom_point() +
scale_color_brewer(palette = 'Dark2')+
theme_minimal()Parallel Coordinates Plot
# parallel coordinates plot
scores_tree_pca %>%
as.data.frame() %>%
mutate(population = as.factor(tree_pcadata$population)) %>%
ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) +
theme_minimal() +
scale_color_brewer(palette = 'Dark2')Loadings Plot
pc1_loadings <- tree_pca_out$rotation[, 1]
loadings_df <- data.frame(
SNP = rownames(tree_pca_out$rotation), # SNP names
Loading = pc1_loadings # Loadings for PC1
)
ggplot(loadings_df, aes(x = reorder(SNP, Loading), y = Loading, fill = Loading)) +
geom_bar(stat = "identity") +
labs(x = "SNP", y = "Loading", title = "Loadings of SNPs for PC1",
fill = "Loading"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Variance Explained
What proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained.
# calculate proportion of variance explained
tree_total_var <- sum(var_tree_pca)
tree_pve <- var_tree_pca/tree_total_var
Visualizing the Proportion of variance explained compares across the PCs
# SNPs
tree_pve %>%
as.data.frame() %>%
mutate(index = seq_len(length(var_tree_pca))) %>%
ggplot(aes(x = index, y = tree_pve)) +
geom_point() +
geom_line() +
labs(x = 'SNP Number', y = 'Percent of Variance Explained') +
theme_minimal()# PCs
# Create data frame for plotting
tree_variance_df <- data.frame(
PC = factor(1:length(tree_pve)),
Individual = tree_pve,
Cumulative = cumsum(tree_pve)
)
# Create scree plot
ggplot(tree_variance_df, aes(x = PC)) +
geom_col(aes(y = Individual)) +
geom_point(aes(y = Cumulative)) +
geom_line(aes(y = Cumulative, group = 1)) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Principal Component", y = "Proportion of Variance Explained", title = "Scree Plot with Cumulative Variance") +
theme_minimal() +
theme(panel.grid.major.x = element_blank())What happens if we would have not realized about the confounding on our data?
Omitted Variable Bias
Reminder: Bias is how long or how far (on average) we are to the “truth”.
In this context we are talking about parameters, so…
If \(\hat{\theta}\) estimates \(\theta\), bias is \(E[\hat{\theta}] - \theta\).
When confounding variables are omitted:
- Estimates become biased (\(E[\hat{\beta} \neq \beta\))
- Variance estimates are also distorted
Expected Value Rules (Probability Foundations):
For random variables \(X,Y\) and constants \(a,b\) (aka Linearity of Expectation):
\[ \begin{aligned} E[X + Y] &= E[X] + E[Y] \\ E[aX] &= aE[X] \\ E[b] &= b \end{aligned} \]
Proof: OLS Unbiasedness (Correct Model)
Model: \(y = X\beta + \epsilon\), where \(E[\epsilon] = 0\)
Estimator: \(\hat{\beta} = (X^TX)^{-1}X^Ty\)
\[ \begin{align*} E[\hat{\beta}] &= E[(X^TX)^{-1}X^Ty] \\ &= (X^TX)^{-1}X^TE[y] \quad \text{(Linearity of expectation)} \\ &= (X^TX)^{-1}X^T X\beta \quad \text{(Since } E[y] = X\beta) \\ &= \beta \quad \end{align*} \]
Proof: OLS Bias (Omitted Variable) <- What would happen to us if we don’t think about Elevation!!!
True Model: \(y = X\beta + Z\gamma + \epsilon\)
Fitted Model: \(y = X\beta + \epsilon^*\) (omits \(Z\))
$$
\[\begin{align*} \hat{\beta} &= (X^TX)^{-1}X^Ty \\ &= (X^TX)^{-1}X^T(X\beta + Z\gamma + \epsilon) \\ &= \beta + (X^TX)^{-1}X^TZ\gamma + (X^TX)^{-1}X^T\epsilon \end{align*}\] $$ Taking expectations:
$$ \[\begin{align*} E[\hat{\beta}] &= \beta + \underbrace{(X^TX)^{-1}X^TZ\gamma}_{\text{Bias term}} \quad \text{(if } X \text{ and } Z \text{ are correlated)} \\ E[\hat{\beta}] &\neq \beta \quad \end{align*}\]
$$
References:
- National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. https://doi.org/10.17226/26902.
- Bryc, K., Durand, E. Y., Macpherson, J. M., Reich, D., & Mountain, J. L. (2015). The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States. The American Journal of Human Genetics, 96(1), 37-53. https://doi.org/10.1016/j.ajhg.2014.11.010
- Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A., King, K. S., Bergmann, S., Nelson, M. R., Stephens, M., & Bustamante, C. D. (2008). Genes mirror geography within Europe. Nature, 456(7218), 98-101. https://doi.org/10.1038/nature07331
- Price, A. L., Zaitlen, N. A., Reich, D., & Patterson, N. (2010). New approaches to population stratification in genome-wide association studies. Nature Reviews. Genetics, 11(7), 459–463. https://doi.org/10.1038/nrg2813
- Simonson, T. S., Yang, Y., Huff, C. D., Yun, H., Qin, G., Witherspoon, D. J., Bai, Z., Lorenzo, F. R., Xing, J., Jorde, L. B., Prchal, J. T., & Ge, R. (2010). Genetic Evidence for High-Altitude Adaptation in Tibet. Science, 329(5987), 72-75. https://doi.org/10.1126/science.1189406
- Stebbins, G. L. (1950). Variation and Evolution in Plants. Columbia University Press. https://doi.org/10.7312/steb94536
- Zheng, L., Gaire, N. P., & Shi, P. (2021). High-altitude tree growth responses to climate change across the Hindu Kush Himalaya. Journal of Plant Ecology, 14(5), 829-842. https://doi.org/10.1093/jpe/rtab035